Here I will create a new SQLite database in my database folder, and then check if it exists.
#Load the packages.
library(DBI)
library(RSQLite)
#Create a new databse in my databse folder.
dir.create("database")
new_folder <- file.path(getwd(), "database")
db_path <- file.path(new_folder, "my_database.sqlite")
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = db_path)
my_db <- "my_database.sqlite"
#Check for the existence of my relational database.
setwd("database")
if (file.exists(my_db)) {
cat("The database file exists!\n")
} else {
cat("The database file does not exist.\n")
}
## The database file exists!
Here I will write a automatic function to extract a table of all R1 (Very High Research Activity) and R2 (High Research Activity) Research Institutions in the US from a Wikipedia page (https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States). The table will include the names, control status, location and Wikipedia page url of those universities.
## # A tibble: 6 × 5
## Institution Control City State Wikipedia_Link
## <chr> <chr> <chr> <chr> <chr>
## 1 Arizona State University Public Tempe AZ https://en.wi…
## 2 Auburn University Public Auburn AL https://en.wi…
## 3 Baylor University Private (non-profit) Waco TX https://en.wi…
## 4 Binghamton University Public Vestal NY https://en.wi…
## 5 Boston College Private (non-profit) Chestnut H… MA https://en.wi…
## 6 Boston University Private (non-profit) Boston MA https://en.wi…
##Exercise 2.b Gathering unstructured data Here, I will write a new scraping function to gather the geographic coordinates, the endowment (USD) and the total number of students of each institution.
## Institution Control City State
## 1 Arizona State University Public Tempe AZ
## 2 Auburn University Public Auburn AL
## 3 Baylor University Private (non-profit) Waco TX
## 4 Binghamton University Public Vestal NY
## 5 Boston College Private (non-profit) Chestnut Hill MA
## 6 Boston University Private (non-profit) Boston MA
## Wikipedia_Link Coordinates
## 1 https://en.wikipedia.org/wiki/Arizona_State_University 33.4209°N 111.9340°W
## 2 https://en.wikipedia.org/wiki/Auburn_University 32.603°N 85.486°W
## 3 https://en.wikipedia.org/wiki/Baylor_University 31.548°N 97.116°W
## 4 https://en.wikipedia.org/wiki/Binghamton_University 42.0888°N 75.9670°W
## 5 https://en.wikipedia.org/wiki/Boston_College 42.33500°N 71.17028°W
## 6 https://en.wikipedia.org/wiki/Boston_University 42.34889°N 71.10028°W
## Endowment Total_Students
## 1 1.47e+09 142636
## 2 1.05e+09 33015
## 3 1.97e+09 20626
## 4 148100000 18148
## 5 3.3e+09 14890
## 6 3.2e+09 36729
In this part, I will add 3 new columns to my previous table based on the file “ivyleague.csv”. The new table will indicate every institution’s Ivy League membership, county and state, and the EIN number for Ivy League members.
## Institution Control City State
## 1 Arizona State University Public Tempe AZ
## 2 Auburn University Public Auburn AL
## 3 Baylor University Private (non-profit) Waco TX
## 4 Binghamton University Public Vestal NY
## 5 Boston College Private (non-profit) Chestnut Hill MA
## 6 Boston University Private (non-profit) Boston MA
## Wikipedia_Link Coordinates
## 1 https://en.wikipedia.org/wiki/Arizona_State_University 33.4209°N 111.9340°W
## 2 https://en.wikipedia.org/wiki/Auburn_University 32.603°N 85.486°W
## 3 https://en.wikipedia.org/wiki/Baylor_University 31.548°N 97.116°W
## 4 https://en.wikipedia.org/wiki/Binghamton_University 42.0888°N 75.9670°W
## 5 https://en.wikipedia.org/wiki/Boston_College 42.33500°N 71.17028°W
## 6 https://en.wikipedia.org/wiki/Boston_University 42.34889°N 71.10028°W
## Endowment Total_Students IvyLeague_Member County EIN
## 1 1.47e+09 142636 No <NA> NA
## 2 1.05e+09 33015 No <NA> NA
## 3 1.97e+09 20626 No <NA> NA
## 4 148100000 18148 No <NA> NA
## 5 3.3e+09 14890 No <NA> NA
## 6 3.2e+09 36729 No <NA> NA
Here I will write the final table generated before into my relational database, and define a function “checkTable()” to check for the existence and correct dimensionality of my written tables.
The difinition of the function and the output can be found as follows.
# Create a checking function.
checkTable <- function(database_name, table_name) {
# Connect to the database
#con <- dbConnect(RSQLite::SQLite(), dbname = database_name)
# Check if the table exists
if (dbExistsTable(con, table_name)) {
# Get the table information
table_info <- dbGetQuery(con, paste0("PRAGMA table_info(", table_name, ")"))
# Get number of rows and columns
table_dimensions <- dbGetQuery(con, paste0("SELECT COUNT(*) AS num_rows, COUNT(*) AS num_columns FROM ", table_name))
# Get column names
column_names <- table_info$name
# Output information
cat("Number of Rows:", table_dimensions$num_rows, "\n")
cat("Number of Columns:", table_dimensions$num_columns, "\n")
cat("Column Names:", column_names, "\n")
} else {
cat("The table does not exist.")
}
# Close the database connection
dbDisconnect(con)
}
con <- dbConnect(RSQLite::SQLite(), dbname = db_path)
checkTable("my_database.sqlite", "Exercise2_Universities_Data")
## Number of Rows: 279
## Number of Columns: 279
## Column Names: Institution Control City State Wikipedia_Link Coordinates Endowment Total_Students IvyLeague_Member County EIN
In this exercise, I will use the RSelenium package to fetch the ranking data from the Academic Ranking of World Universities (https://www.shanghairanking.com/).
Here, I will create a webscraper to gather the ARWU ranking for the Ivy League institutions across the years 2002, 2013 and 2023, and modify the ranking data to integer.
## Institution_Year Rank
## 1 University of Pennsylvania_2023 14
## 2 Brown University_2023 98
## 3 Columbia University_2023 8
## 4 Cornell University_2023 12
## 5 Dartmouth College_2023 350.5
## 6 Harvard University_2023 1
Call the function checkTable() to check for existence and dimensionality of the final table.
# Function to check table existence and dimensionality
con <- dbConnect(RSQLite::SQLite(), dbname = db_path)
checkTable("my_database.sqlite", "ARWU_ranks")
## Number of Rows: 24
## Number of Columns: 24
## Column Names: Institution_Year Rank
Here, I will scrape all social science subjects and their ranks for each Ivy League university in 2023.
## # A tibble: 6 × 2
## Institution_Subject Rank
## <chr> <chr>
## 1 University of Pennsylvania_Economics 14
## 2 Brown University_Economics 46
## 3 Columbia University_Economics 7
## 4 Cornell University_Economics 22
## 5 Dartmouth College_Economics 63
## 6 Harvard University_Economics 2
Call the function checkTable() to check for existence and dimensionality of the final table.
# Function to check table existence and dimensionality.
con <- dbConnect(RSQLite::SQLite(), dbname = db_path)
checkTable("my_database.sqlite", "Subject_ranks")
## Number of Rows: 112
## Number of Columns: 112
## Column Names: Institution_Subject Rank
Here, I will use the package “tidycensus” to access a raw API and gather financial data (2011 - 2021) as well as local economic data (2015 and 2020) for Ivy League members.
In the first table, I retrieved the Total Revenue and Total Assets of each Ivy League institution for the years 2011 - 2021.
## Institution_Year Total_Revenue Total_Assets
## 1 University of Pennsylvania_2021 9337742000 31220689000
## 2 University of Pennsylvania_2020 7624539000 24607104000
## 3 University of Pennsylvania_2019 7942744000 22754777000
## 4 University of Pennsylvania_2018 7238854000 21301691000
## 5 University of Pennsylvania_2017 6710820000 19054006000
## 6 University of Pennsylvania_2016 6286195000 17580872000
Call the function created in 2d to check for the existence as well as dimensionality of the final table. The output is as follows.
# Check for existence and dimensionality.
con <- dbConnect(RSQLite::SQLite(), dbname = db_path)
checkTable("my_database.sqlite", "Financial_data")
## Number of Rows: 82
## Number of Columns: 82
## Column Names: Institution_Year Total_Revenue Total_Assets
Based on the data collected before and the tidycensus package, I will here first retrieve all counties in the US and their estimated median household income for both 2015 and 2020. The output tables can be found as follows.
## # A tibble: 3,220 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01001 Autauga County, Alabama B19013_001 51281 2391
## 2 01003 Baldwin County, Alabama B19013_001 50254 1263
## 3 01005 Barbour County, Alabama B19013_001 32964 2973
## 4 01007 Bibb County, Alabama B19013_001 38678 3995
## 5 01009 Blount County, Alabama B19013_001 45813 3141
## 6 01011 Bullock County, Alabama B19013_001 31938 5884
## 7 01013 Butler County, Alabama B19013_001 32229 1793
## 8 01015 Calhoun County, Alabama B19013_001 41703 925
## 9 01017 Chambers County, Alabama B19013_001 34177 2949
## 10 01019 Cherokee County, Alabama B19013_001 36296 1710
## # ℹ 3,210 more rows
## # A tibble: 3,221 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01001 Autauga County, Alabama B19013_001 57982 4839
## 2 01003 Baldwin County, Alabama B19013_001 61756 2268
## 3 01005 Barbour County, Alabama B19013_001 34990 2909
## 4 01007 Bibb County, Alabama B19013_001 51721 6237
## 5 01009 Blount County, Alabama B19013_001 48922 2269
## 6 01011 Bullock County, Alabama B19013_001 33866 10094
## 7 01013 Butler County, Alabama B19013_001 44850 2891
## 8 01015 Calhoun County, Alabama B19013_001 50128 1964
## 9 01017 Chambers County, Alabama B19013_001 43875 3861
## 10 01019 Cherokee County, Alabama B19013_001 42509 3824
## # ℹ 3,211 more rows
Then, I will match the local economic data with the Ivy League data collected earlier, creating a new Ivy League universities table to show the counties where each institution is located and the local estimated median household income data for 2015 and 2020.
## Institution_Year County
## 1 University of Pennsylvania_2015 Philadelphia County, Pennsylvania
## 2 University of Pennsylvania_2020 Philadelphia County, Pennsylvania
## 3 Brown University_2015 Providence County, Rhode Island
## 4 Brown University_2020 Providence County, Rhode Island
## 5 Columbia University_2015 New York County, New York
## 6 Columbia University_2020 New York County, New York
## Estimated_median_household_income
## 1 38253
## 2 49127
## 3 49743
## 4 62323
## 5 72871
## 6 89812
After I have collected the data I need and written it into my relational database, I will then use the checkTable function defined in 2d to check the existence and correct dimensionality of the final table. The outcome is as follows.
# Check for the existence and correct dimensionality.
con <- dbConnect(RSQLite::SQLite(), dbname = db_path)
checkTable("my_database.sqlite", "Ivy_League_with_Economic_Data")
## Number of Rows: 16
## Number of Columns: 16
## Column Names: Institution_Year County Estimated_median_household_income
In this exercise, I will first collect a table indicating the Ivy League institutions’ rank and financial data. Then I will use 4 plots to interpret the relationships between these variables.
The following table illustrates: 1. Institution name 2. The average rank of the university across 2003, 2013, and 2023 3. The average rank of the university’s Economics, Political Science, and Sociology programs, if they were ranked 4. The current endowment per student (total endowment divided by total number of students), in USD 5. The average total revenue per student across the years 2015 - 2020, in USD 6. The average of the median household income for the County across the years 2015 and 2020, in USD
## Institution Avg_ranking Economics_Rank Political_Science_Rank
## 1 Brown University 71 46 63
## 2 Columbia University 9 7 8
## 3 Cornell University 12 22 32
## 4 Dartmouth College 218 63 45
## 5 Harvard University 1 2 2
## 6 Princeton University 7 6 3
## Sociology_Rank Endowment_per_Student Avg_Revenue_Per_Student
## 1 32 633093.5 125638.6
## 2 20 382381.7 156524.7
## 3 11 380459.6 181180.6
## 4 NA 1286319.6 209856.2
## 5 1 2345810.4 281756.9
## 6 14 4222694.0 328221.7
## Avg_median_household_income
## 1 56033
## 2 81342
## 3 56993
## 4 61619
## 5 95660
## 6 78055
In Figure 1, the line graph shows the ranks for 3 different social
subjects in terms of Economics, Political Science and Sociology, and the
overall average rank of every Ivy League universities in 2023. Overall,
the subject rankings are roughly in line with the overall average
rankings. But it is interesting to find that while Dartmouth College was
ranked the lowest among all the Ivy League institutions with a number of
over 200, it demonstrates a remarkable academic strength in Economics
and Political Science.
## Exercise 5a Plot 2 In Figure 2, we can observe the average rank and
endowment per student of each Ivy League universities. In terms of the
ranking data, Dartmouth College shows a great distance from other Ivy
League members, with a figure of 218, compared to other universities
that are ranked under 100. Notablly, Columbia University and Cornell
University have obtained prominent ranks of 9 and 12 respectively with
the lowest endowment per student among all the 8 institutions, while
other members such as Princeton University has accquired a ranking of 7
with over seven times of per student endowment from Columbia and
Cornell.
## Exercise 5a Plot 3 In Figure 3, the line chart illustrates the
average endowment per student of Ivy League institutions, and the
average median household income of where these institutions are located.
It is obvious that the local income is considerably less than the
endowment of the universities. Particularly noteworthy is the fact that
the endowment varies dramatically among the eight institutions as well,
with Columbia University and Cornell Univeristy showing the lowest point
and Princeton University the highest on the scale. That’s a notable
difference of over 7 times.
## Exercise 5a Plot 4 In Figure 4, the line graph reflects the average
median household income of where the Ivy League institutions are
located, as well as the average revenue per student of each university.
Overall, the average revenue per student of every institution is much
higher than the local average median household incom, with Yale
University being the most notable. In contrast to Yale, Brown University
demonstrate the minimal income disparity, with the lowest average per
student revenue of $ 125639 among all the 8 institutions.
## Exercise 5.b Visualisation of geographic data In this exercise, I
will first call into R from my relational database a table that includes
data needed for visualisation.
Since the follow-up visualisation work needs to use separate numerical types of latitude and longitude coordinates, I will use Rsql to convert the string-type coordinates collected earlier for further processing. The final obtained table is as follows. The table contains the information of the institution’s name, coordinate values, status, and Ivy League membership.
## Institution Latitude Longitude Status
## 1 Arizona State University 33.42090 -111.93400 Public
## 2 Auburn University 32.60300 -85.48600 Public
## 3 Baylor University 31.54800 -97.11600 Private (non-profit)
## 4 Binghamton University 42.08880 -75.96700 Public
## 5 Boston College 42.33500 -71.17028 Private (non-profit)
## 6 Boston University 42.34889 -71.10028 Private (non-profit)
## IvyLeague_Member
## 1 No
## 2 No
## 3 No
## 4 No
## 5 No
## 6 No
Next, I will use the data to generate a interactive mapping of research institutions in the United States.
As it is evident from Figure 5, Ivy League schools are concentrated in the northeastern part of the mainland U.S., where 75 percent of a coastal distribution is found. Overall, the R1 and R2 institutions are more densely located in the East than in the West of mainland US, and are more concentrated along the coast than inland, comprising a much larger share of public institutions than the privates.
It is also notable that in terms of the allocation of research resources, northwestern states such as Wyoming, Nevada, Montana, and South Dakota are particularly deprived.
To understand the geographic distribution patterns of research institutions I have observed, multiple perspectives and factors could be taken into concern. Initially, in my point of view, the economic and industrial factor is of prime importance. As European colonists first landed from the Atlantic coast of the northeastern United States, the northeastern industrial region became the earliest industrial area and the first growth area in the nation. The region was initially dominated by traditional industries such as coal, steel, machinery, automobiles, and chemicals, leading to the accumulation of capital and resources during the early period of history. The economic and industrial development thereby contributed to higher concentrations of educational resources in these areas due to collaboration opportunities, funding availability, and industry partnerships. Secondly, historical and cultural factors are indispensable. Many research institutions (especially the Ivy League members) have deep cultural roots shaped by the history. Combined with the colonial history of the mainland US, we may assume that the early settlement of colonisers and the presence of wealth families might influence the establishment of research institutions. Additionally, governmental policies, funding initiatives, and investments in education, research, and technology might have favored certain regions, leading to disparities in research resource allocation. To further investigate this case, more detailed data such as demographic statistics and policy information need to be involved.